Text mining has been applied to vast variety of texts and research goals, like linguistics (for an overview, see Biber and Reppen 2015), narratology (Mahmoudi and Abbasalizadeh 2018) and political science (Wilkerson and Casas 2017). Though the computational toolkit is ever-expanding, ranging from simple techniques to sophisticated machine learning algorithms, there exist a number of steps are employed in a vast majority of projects of this kind.
In what follows, I will apply a subset of the techniques that are prominent in those approaches to an ongoing message log on the Facebook-owned platform Whatsapp, spanning one entire year – highlighting along the way natural ways of combining the different techniques in order to bolster the results brought forth by them individually. Specifically, following data preparation, I will first analyse message- and conversation-level statistics such as message length and frequency, focusing on points of contrasts between the two speakers in the corpus and on message clusters on the time axis with a focus on time of day, day of the week and month.
Following that, regular expression queries will be employed to identify typical themes and their distribution and status in the conversation, as well as n-grams to carve out common multi-word sequences that both unique to one speaker and shared between both. In a final step, the entire corpus will be analysed using sentiment analysis, focusing one base emotions on the one hand and simply emotional valance (or polarity) on the other. Combined, all of these technique will enable us to generate a clear picture as to how both speakers structure and compose their contributions to shared conversational goals and how both of diverge from the other.
# load packages
packages <- c(
# general
"tidyverse", "forcats", "here", "psych", "janitor",
"lubridate", "syn", "broom", "stringr",
# rmarkdown related
"kableExtra", "styler", "knitr",
# plotting
"gganimate", "scales", "patchwork",
"scico", "hrbrthemes",
# corpus and text processing
"wordcloud", "tm", "syuzhet", "ngram"
)
xfun::pkg_attach(packages, install = TRUE)
# allow duplicate labels for chunks
options(
scipen = 999, width = 100, max.print = 999
)
# seed as random number generation tasks
set.seed(1234)The message log to form our object of study was obtained by using the provided export option within the iPhone app. Note that due to an update earlier in the year and following a legal decision by German courts, this functionality no longer exists for German users of the app.1 Because of this, I will be analyzing an exported messages log that was created prior to the loss of that functionality.
In order to make the messages usable for the operations to follow, the text file, see the listing below for an example of the structure of the exported messages, needs to be cleaned. Though most of the information that we will need is present in the file (date, time, sender, and message data are exported as a single line per message), it is more convenient to work with data matrices.
[10.09.18, 17:13:16] Andrew: Sweet
[10.09.18, 17:13:18] Andrew: Sounds good
[10.09.18, 17:13:25] Andrew: Will bring my camera
[10.09.18, 17:14:15] Maik: Awesome! Do you want to meet up there or should I come pick you up at some bus station?
[10.09.18, 17:14:28] Andrew: Where do you live again?
Below, I have detailed the Python 3 (https://www.python.org, version 3.8.1) script that was used to convert the raw text file into a csv file, which will later be imported into R and processed further. In particular, regular expressions were applied to separate the different kinds of information into separate lists, each associated with only type of data (date, time, sender, and message text). As a final step, the lists were combined into a csv file using a pandas (The pandas development team 2020; McKinney 2010) dataframe as an intermediate data representation.
import pandas as pd
import re
# empty lists that we will populate by finding regular expression patterns
# in the raw data. In the final step, we will combine the lists into a data
# frame which will be our final data for use in R
msg_date = []
msg_time = []
msg_sender = []
msg = []
# open the chat log in UTF-8 format
with open('data_raw/_chat.txt', 'r', encoding='utf-8') as f:
string = f.readlines()
# iterate over the entire log to identify different regex patterns
for row in range(1, len(string)):
# the date pattern we want to look for
date_pattern = '(\d+.\d+.\d+)'
# if the date is found, add it to the list, if not add NA
try:
date = re.search(date_pattern, string[row]).group(0)
except AttributeError:
date = "NA"
msg_date.append(date)
# same process for time stamps
time_pattern = '\d+:\d+:\d+'
try:
time = re.search(time_pattern, string[row]).group(0)
except AttributeError:
time = "NA"
msg_time.append(time)
# now find the senders of the individual messages
person_pattern = '[\]]\s\w+'
try:
# use the entire match but delete the closing square bracket
person = re.search(person_pattern, string[row]).group(0).replace("] ", "")
except AttributeError:
person = "NA"
msg_sender.append(person)
# and, finally, the messages themselves
msg_pattern = '(:\s).*'
try:
# delete the colon and the space that follows it
message = re.search(msg_pattern, string[row]).group(0).replace(": ", "")
except AttributeError:
message = "NA"
msg.append(message)
# combine the lists into a data frame and add a row that contains the column names
df = pd.DataFrame(list(zip(msg_date, msg_time, msg_sender, msg)),
columns=['date', 'time', 'sender', 'message'])
# export the data frame as a csv file
df.to_csv("data/messages_cleaned.csv", index=False)Following the preprocessing, the resulting file was read into R (R Core Team 2020). The code listing below shows the data in its form after the preprocessing and before further computational steps were performed.
# import data from python script output
d <- read.csv(here("data", "messages_cleaned.csv"), as.is = TRUE)
data.frame(
variable = names(d),
class = sapply(d, typeof),
first_values = sapply(d, function(x) {
paste0(head(x, 3),
collapse = ", "
)
}),
row.names = NULL
)At this stage, The entire corpus totals 33848 messages. However, next, all rows containing missing values were eliminated from the data set, leaving 33579 rows of complete data. Missing values were generated whenever no date, sender, or message could be identified. As far as I can tell, this occurred whenever media files of any kind (images, video files, voice messages) were sent or when non-message events like WhatsApp-internal calls were carried out. In light of this, these rows can be deleted safely and will not affect any of the following analyses in a meaningful way.
At this stage, all we can really see is that one speaker, Andrew, seems to have sent quite a few more messages than the other, as shown in the table below. However, further analyses might lead us to revise this conclusion. After all, this majority might potentially be comprised of low-word utterances.
Because of the way that R handles dates internally, namely with a designated data format, some conversions into that format are in order before we can proceed to draw conclusions from our data. This will not only come into play when visualizing the data, but also when creating further subdivisions for the data information. Specifically, below columns are added that identify the the month the message was sent as well as the corresponding day of the week, respectively.
# data format conversions
# adding date, time, month and week day information
d <- d %>%
mutate(
date = dmy(date),
time = hour(hms(time)),
month = months(date, abbreviate = TRUE),
month = factor(month,
levels = c(
"Sep", "Oct", "Nov", "Dec",
"Jan", "Feb", "Mar", "Apr",
"May", "Jun", "Jul", "Aug"
)
),
day = wday(date, label = TRUE),
day = factor(day,
levels = c(
"Mon", "Tue", "Wed",
"Thu", "Fri", "Sat", "Sun"
)
)
)The table below features the state of the data after these additions.
Now, we are in a position to take a first look at the data in aggregation and maybe even revise some of our earlier results. Formed from a conversation between two individuals over a prolonged period of time, this corpus allows us the check various statistics in connection with who contributed to it in what way.
# length of messages and message count per sender
d <- d %>%
mutate(length = nchar(message)) %>%
group_by(sender) %>%
mutate(count = length(message)) %>%
ungroup()# count all the words
d$word <- str_count(d$message, boundary("word"))
# amount of words in the corpus
sum(d$word)
[1] 167270The code above allows for the following conclusions: Andrew sent 19937 messages, while Maik sent only 13642.2 However, the longest message of 76 words was sent by Maik, who, at the same time supersedes Andrew in average message length: 4.4042 words for Andrew, 5.825 for Maik.
The visualization below gives a more detailed impression of the message length distribution, showing a substantial skew to the right, in favor of shorter messages.
# word summary
words <- tabyl(d$word) %>%
adorn_pct_formatting() %>%
head(15)
ggplot(words, aes(x = `d$word`, y = n)) +
geom_line(color = "#98D4D4FF") +
scale_x_continuous(breaks = pretty_breaks(n = 15)) +
ylim(0, max(words$n)) +
labs(y = "Occurrences", x = "Message Length in Words")A more detailed overview of how each speaker contributed to the conversation overall and in times of extreme (in)activity is shown in the following tables. The first of which deals with character length, the latter with (white)space-delimited character sequences.
Summary statistics generated using Revelle (2018).
In the visualizations below (all of which, including the ones to follow, were created using the ggplot package; Wickham 2016), the amount is messages is broken down over the span of the entire year as well as its sender. The lower panel allows an inspection of the distribution over moths and weekdays. Notably, the hours of the night as well as the weekends feature less prominently than other time frames.
msgct1 <- ggplot(d, aes(x = date, color = sender)) +
stat_bin(geom = "line") +
labs(
x = "Time of Year",
y = "Message Count",
color = "Sender"
) +
theme(
legend.position = c(0.85, 0.95),
axis.text.x = element_text(angle = 60, hjust = 1)
) +
scale_x_date(breaks = pretty_breaks(n = 12)) +
scale_color_manual(values = c("#98D4D4FF", "#FF929AFF"))
msgct2 <- ggplot(d, aes(x = time, fill = sender)) +
geom_histogram(position = "dodge") +
labs(
x = "Time of Day",
y = "Message Count",
fill = "Sender"
) +
theme(
legend.position = c(0.3, 0.95),
legend.direction = "horizontal"
) +
scale_fill_manual(values = c("#98D4D4FF", "#FF929AFF"))
msgct3 <- ggplot(d, aes(x = day, fill = sender)) +
geom_bar(position = "dodge") +
labs(
x = "Weekday",
y = "Message Count",
fill = "Sender"
) +
theme(legend.position = c(0.75, 0.95)) +
scale_fill_manual(values = c("#98D4D4FF", "#FF929AFF"))
msgct1 /
(msgct2 | msgct3)This finding is further confirmed by the precise percent amounts shown in the table below:
These visualizations, instead of detailing the message count, feature the message length in characters. Note that the averages shown in the sparse areas, e.g., the hours of the night, are less reliable since outlying messages (of starkly different length than the rest) will affect measures of central tendency disproportionately. Once those areas of considerable uncertainty are excluded, however, one can clearly see that the overall message length for each speaker as well as the difference between them seems fairly constant.
msgln1 <- ggplot(d, aes(x = date, y = length, color = sender)) +
stat_summary(fun = mean, geom = "line") +
labs(
x = "Time of Year",
y = "Message Length",
color = "Sender"
) +
theme(
legend.position = c(0.3, 0.95),
legend.direction = "horizontal",
axis.text.x = element_text(angle = 60, hjust = 1)
) +
scale_x_date(breaks = pretty_breaks(n = 12)) +
scale_color_manual(values = c("#98D4D4FF", "#FF929AFF"))
msgln2 <- ggplot(d, aes(x = time, y = length, color = sender)) +
stat_summary(fun = mean, geom = "line") +
labs(
x = "Time of Day",
y = "Mean Character and Word Amount Per Message",
color = "Sender"
) +
stat_summary(aes(y = word),
fun = mean, geom = "line",
linetype = "dashed"
) +
theme(
legend.position = c(0.5, 0.95),
legend.direction = "horizontal"
) +
scale_color_manual(values = c("#98D4D4FF", "#FF929AFF"))
msgln1 /
msgln2A further point of exploration concerns sentence types and their distribution. Before going into the results, however, a few words of caution are in order. As this corpus is comprised of non-formal text, relying, as I do here, on graphemic demarcation of sentence types is tentative at best. This holds for declarative clauses in particular as informal messages rarely end in periods in instant messaging environments. As such, without adding to the computational complexity of the present project, declarative results should not be taken as representative for the text. Scanning the for question and exclamation marks, however, should allow us to check at least the difference in distribution between the two.
# analyse sentence types including proportions
sentence_types <- paste(tolower(d$message), collapse = " ")
# add space before sentence type punctuation
sentence_types <- gsub("\\!", " !", sentence_types)
sentence_types <- gsub("\\?", " ?", sentence_types)
sentence_types <- gsub("\\.", " .", sentence_types)
# split at space to generate individual words
sentence_types <- data.frame(strsplit(sentence_types, " "))[, 1]
# subset out only ?, ! and .
sentence_types <- factor(sentence_types[sentence_types %in%
c("?", "!", ".")])The table below confirms the remarks made at the outset of this section: declarative utterance form the overwhelming minority, contrary to what we would expect from texts otherwise. If the goal was to gather reliable data, as opposed to simply exploring the limits of simple text mining approaches, a modulation of the regular expression patterns would be in order. Or, failing that, the results would have to be discarded. As it stands, we can observe that, under the chosen approach, imperative/exclamative and interrogative utterance are used here in relative equilibrium (though, of course, the same caveats regarding graphematic demarcation apply to these as well).
# compute the overall occurrences as well as their ratio
sentence_types <- tabyl(sentence_types)
names(sentence_types) <- c("sentence", "occurrences", "percent")
sentence_types$percent <- percent(sentence_types$percent)
sentence_typesIn this section, I wanted to identify different types of messages and their impact on the entire text body. To do this, eight look-up patterns were defined according to the type of message they were designed to find. One such pattern (sorry.txt) is exemplified below, the seven remaining ones can be found in the repository for this document in the assets directory.
# contents of assets/sorry.txt
sorry|apologize|entschuldigung|tut mir leid
Next, a function was defined that would import one such pattern from its text file and run a case insensitive regular expression search over all of the messages in the corpus. The resulting counts for each of these searches were added in separate column for the data frame.
# function to read in the patterns from the text files with regex patterns
# and count the occurrences in the message column in form of a vector
# with each value representing count per row (message)
input_pattern <- function(file) {
fullfile <- here("assets", paste0(file, ".txt"))
pattern <- readChar(fullfile, file.info(fullfile)$size)
str_count(d$message, regex(pattern, ignore_case = TRUE))
}
# look for specific message type and create a counting column
# for all eight patterns
d <- d %>%
mutate(
ily = input_pattern("ily"),
miss = input_pattern("miss"),
baby = input_pattern("pet"),
sex = input_pattern("sex"),
sorry = input_pattern("sorry"),
insult = input_pattern("insult"),
tired = input_pattern("tired"),
drag = input_pattern("drag")
)The table below shows the distribution of all of the included message types.
# check the overall amounts of the new columns
d %>%
gather(
"type", "count", ily, miss, baby, sex,
insult, sorry, tired, drag
) %>%
group_by(type) %>%
summarise(count = sum(count)) %>%
arrange(desc(count)) %>%
mutate(prop = percent(count / sum(count))) -> dlong
dlongBroken down by speaker, we can see that Andrew’s messages are represented much more in the regular expression search patterns. Though we are not in a position to judge why exactly that might be, there are a few potential candidates. One, Andrew might make less varied use of such expressions, leading to high yields, while Maik might employ a lot of variants, of which maybe only a few are contained in the pattern lists. A second explanation might be that Maik simply makes more spelling errors, for one reason or another, which would similarly lead to low precision when it comes to the strategy of identifying message types that was employed here.
d %>%
gather(
"type", "count", ily, miss, baby, sex,
insult, sorry, tired, drag
) %>%
group_by(type, sender) %>%
summarise(count = sum(count)) %>%
arrange(desc(count))The plot below highlights one message type (I-love-you-type messages and related displays of affection) in all of the dimensions of the corpus we have considered so far by showing the mean occurrence rate per message over the entirety of the year under study, separately for each month, and by day of the week.
ilyplt1 <- ggplot(d, aes(x = date, y = ily, color = sender)) +
stat_summary(fun = mean, geom = "line") +
labs(
x = "Time of Year",
y = "I-Love-You Type Messages",
color = "Sender"
) +
scale_color_manual(values = c("#98D4D4FF", "#FF929AFF")) +
coord_cartesian(ylim = c(0, 2)) +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
scale_x_date(breaks = pretty_breaks(n = 12))
ilyplt2 <- ggplot(d, aes(x = month, y = ily, color = sender, group = sender)) +
stat_summary(fun = mean, geom = "line") +
stat_summary(fun = mean, geom = "point") +
scale_color_manual(values = c("#98D4D4FF", "#FF929AFF")) +
labs(
x = "Month",
y = "I-Love-You Type Messages"
) +
guides(color = FALSE)
ilyplt3 <- ggplot(d, aes(x = day, y = ily, color = sender, group = sender)) +
stat_summary(fun = mean, geom = "line") +
stat_summary(fun = mean, geom = "point") +
scale_color_manual(values = c("#98D4D4FF", "#FF929AFF")) +
stat_summary(
fun.data = mean_se, geom = "errorbar",
width = 0.25, alpha = .5, linetype = 1,
color = "#98D4D4FF"
) +
labs(
x = "Weekday",
y = "I-Love-You Type Messages"
) +
guides(color = FALSE)
ilyplt1 /
(ilyplt2 | ilyplt3)The following visualizations detail four of the message types by their mean occurrence per message for each speaker separately.
d %>%
gather("type", "count", ily, miss, baby, sex) %>%
ggplot(aes(x = month, y = count, color = type, group = type)) +
stat_summary(fun = mean, geom = "line") +
labs(
x = "Month",
y = "Message Count",
color = "Type",
group = "Type"
) +
scale_color_manual(values = scico(4, palette = "batlow")) +
theme(
legend.position = c(0.52, 0.95),
legend.direction = "horizontal",
axis.text.x = element_text(angle = 60, hjust = 1)
) +
facet_wrap(~sender)Gif generated using Pedersen and Robinson (2020)
d %>%
gather(
"type", "count", ily, baby, miss, sex,
insult, sorry, tired, drag
) %>%
mutate(month = factor(month, levels = month.abb)) %>%
count(type, date, month, wt = count) %>%
ggplot(aes(reorder(type, -n), n, color = type)) +
geom_line(
mapping = aes(group = 1),
stat = "summary", fun = mean
) +
geom_jitter(alpha = .3) +
geom_point(stat = "summary", fun = mean, size = 3) +
stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.35) +
labs(
title = "Mean of Message Type Occurrence in: {closest_state}",
x = "Message Type",
y = "Mean Occurrence"
) +
guides(color = FALSE) +
coord_cartesian(ylim = c(0, 15)) +
scale_color_scico_d(palette = "batlow") +
transition_states(month,
state_length = 3,
transition_length = 5
)In this section, we will try to identify something that might be called each speaker’s lexical signature, i.e., his most used words. To do this, all punctuation and other superfluous material will be removed from the object of study using regular expression. Further, we will work with three message collection objects: one that contains all of the messages, and two that contain only the messages sent by one speaker. To achieve this, we will use the tm package (Feinerer, Hornik, and Meyer 2008).
# create one long string with all the messages
messages_all <- paste(tolower(d$message), collapse = " ")
# remove some strings that would hinder further analyses
messages_all <- gsub(",|\\(|)|\\.|:|;|\\!|\\?", "", messages_all)
messages_all <- gsub("[[:punct:]]", "", messages_all)
# create a long string for each of us
messages_all_a <- paste(tolower(d$message[d$sender == "Andrew"]),
collapse = " "
)
messages_all_m <- paste(tolower(d$message[d$sender == "Maik"]),
collapse = " "
)After separating the messages, we need to create the virtual corpora and prepare them further for analyses. One crucial step is the removal of so-called stop-words. These are words that we are not particularly interested in in the pursuit of comparing the personal lexicon of each of the speakers, namely closed class lexical items that tend to be the most frequent terms in texts at any rate, simply by virtue of their role in grammar. Additionally, there are some artifacts, which, I assume, are due to text encoding that occurs when whatsapp exports message logs. Specifically, contraction and possessive markers are encoded in such a way that the preprocessing functions in tm do not recognize them. To combat this, they were removed manually.
Coming back to tm, the corpora are prepared using the custom function below, which takes care of several issues that would hinder our current efforts; among them removing white space and number tokens.
# assemble the corpora
corpus <- VCorpus(VectorSource(messages_all))
corpus_andrew <- VCorpus(VectorSource(messages_all_a))
corpus_maik <- VCorpus(VectorSource(messages_all_m))
# generate list of words to be removed from the corpora
mystopwords <- c(
stopwords("english"), "'re", "'ll", "omit", "didn'",
"’ve", "’ll", "didn’", "didn’", "don’", "’re",
"image", "video", "doesn’", "omit", "omitted", "can’t",
"isn’t", "wasn’t", "let’s", "haven’t", "won’t", "couldn’t"
)
mystopwords <- setdiff(mystopwords, c("love"))
# function to perform some cleanup operations
prep_corpus <- function(c) {
c <- tm_map(c, content_transformer(tolower))
c <- tm_map(c, removeWords, mystopwords)
c <- tm_map(c, removeNumbers)
c <- tm_map(c, removePunctuation)
c <- tm_map(c, stripWhitespace)
# seems to mess everything up, so I do not stem this corpus
# c <- tm_map(c, stemDocument)
c
}
# use function to prep all three corpora
corpus <- prep_corpus(corpus)
corpus_andrew <- prep_corpus(corpus_andrew)
corpus_maik <- prep_corpus(corpus_maik)After all of the corpora have been preprocessed, we can inspect them one by one and generate word cloud visualizations using the wordcloud package (Fellows 2018). The visualizations to follow feature the most commonly occurring words overall, for Andrew, and for Maik, respectively. As is clearly evident, there is a large amount of overlap in the terms constituting the majority of the conversational contribution of both.
# generate a document-term matrix
dtm_all <- DocumentTermMatrix(corpus)
# length should be total number of terms (i.e., unique words)
corpfreq <- colSums(as.matrix(dtm_all))
length(corpfreq)
[1] 9409cols <- c(
"#1F7A80FF", "#79C4B2FF", "#98D4D4FF",
"#FF929AFF", "#FF6359FF", "#9C5568FF"
)
wordcloud(corpus,
max.words = 150, random.order = FALSE,
min.freq = 5, rot.per = 0, use.r.layout = TRUE,
colors = cols, scale = c(3, .7)
)dtm_a <- DocumentTermMatrix(corpus_andrew)
# length should be total number of terms
corpfreq <- colSums(as.matrix(dtm_a))
length(corpfreq)
[1] 6494wordcloud(corpus_andrew,
max.words = 150, random.order = FALSE,
min.freq = 5, rot.per = 0, use.r.layout = TRUE,
colors = cols, scale = c(3, .7)
)dtm_m <- DocumentTermMatrix(corpus_maik)
# length should be total number of terms
corpfreq <- colSums(as.matrix(dtm_m))
length(corpfreq)
[1] 6011wordcloud(corpus_maik,
max.words = 150, random.order = FALSE,
min.freq = 5, rot.per = 0, use.r.layout = TRUE,
colors = cols, scale = c(3, .7)
)Going into a bit more detail, one interesting conclusion from this section is the fact both speakers employ roughly 6000 non-excluded distinct words (though forms might be a better term) in their messages. Overall, however, about 9000 words remain after the preprocessing steps, which suggests that the vocabularies of both speakers differ quite drastically, with 3000 words which belong only to one of the speakers’ active lexical repertoire – at least in the conversation we are considering now.
The list below, again, features a roundup of frequent lexical items. However, these lists give the clear impression that there is a large amount of overlap. In order to identify each speaker’s unique lexical choices, we will make use of complement sets in this section. Specifically, we will be looking for tokens in the corpus that are only present for one of the speakers. To make the list less susceptible to typos and occasionalisms, only those tokens with a minimum frequency of 5 will be considered.
# list words that occur at least 150 times for each corpus
findFreqTerms(dtm_a, lowfreq = 150)
[1] "also" "baby" "back" "bby" "bit" "bus" "can" "class"
[9] "come" "cus" "day" "get" "going" "gonna" "good" "got"
[17] "haha" "hahaha" "heading" "hehe" "hehehe" "home" "hope" "just"
[25] "kikiki" "later" "like" "love" "lunch" "maybe" "might" "morning"
[33] "much" "nice" "night" "now" "okay" "one" "place" "really"
[41] "right" "see" "sleep" "soon" "still" "think" "time" "today"
[49] "tomorrow" "wanna" "will" "yeah" "yep" "yes"
findFreqTerms(dtm_m, lowfreq = 150)
[1] "baby" "can" "day" "done" "fine" "fun" "get" "going" "gonna"
[10] "good" "got" "great" "haha" "hahaha" "home" "hope" "just" "know"
[19] "later" "like" "love" "morning" "much" "need" "nice" "night" "now"
[28] "okay" "one" "really" "right" "see" "sleep" "soon" "sorry" "still"
[37] "super" "sure" "think" "though" "time" "wanna" "well" "will" "work"
[46] "yeah" "yup" Though only a few examples are shown above, the interested reader can find the full set of unique tokens in the data directory of this repository where both are exported as unique_andrew.txt and unique_maik.txt respectively.
Going beyond single words, we will now process n-grams using the ngram package (Schmidt and Heckendorf 2017). As ngram only works with continuous strings and not with virtual corpora created with tm or vectorized data, we will have to preprocess the message once more. This is done with a custom function, which, again, removes some artifacts probably caused by media messages.
# function to preprocess the string
prep <- function(x) {
gramstring <- preprocess(x,
case = "lower",
remove.punct = TRUE
)
gramstring <- gsub(
"omit|omitted|image", "",
gramstring
)
gramstring
}
# preprocessing for overall and per sender strings
gramstring <- prep(messages_all)
gramstring_a <- prep(messages_all_a)
gramstring_m <- prep(messages_all_m)The tables highlight that, despite the substantial amount of lexical variation between the speakers, both predominantly make use of short words, averaging about 4 characters each. Again, this is probably due to the informal nature of WhatsApp conversations and the existence closed class lexical items, which tend to be quite short, like a, the, and the various inflected forms of be.
# generate a summary
# for all messages combined
a <- string.summary(gramstring)
data.frame(a[1], a[2], a[6], av_word_length = a[[2]] / a[[6]])# process trigrams
tri <- ngram(gramstring, n = 3)
tri_a <- ngram(gramstring_a, n = 3)
tri_m <- ngram(gramstring_m, n = 3)# extract top trigrams
tridat <- head(get.phrasetable(tri), 15)
tridat_a <- head(get.phrasetable(tri_a), 15)
tridat_m <- head(get.phrasetable(tri_m), 15)Having processed the trigrams, we can now visualize which ones are the most common. The plot below shows, in the upper panel, the most prevalent trigrams overall and, in the lower panels, the results for each speaker in isolation. Apart from simply showing the trigrams, this also confirms that the message type analyses using regular expressions captured very ubiquitous themes in the conversation at hand: the first four trigrams in the overall panel, for example, are clear instantiations of the types of messages that were caught by the i-love-you-type pattern and there are cases of other message types as well. Insofar, then, both analyses substantiate each other by achieving similar results by different approaches.
tri_all <- ggplot(tridat, aes(x = reorder(ngrams, freq), y = freq)) +
geom_bar(stat = "identity", fill = "#98D4D4FF") +
coord_flip() +
labs(x = "Trigrams", y = "Count")
tri_a <- ggplot(tridat_a, aes(x = reorder(ngrams, freq), y = freq)) +
geom_bar(stat = "identity", fill = "#98D4D4FF") +
coord_flip() +
labs(x = "Trigrams", y = "Count", title = "Andrew")
tri_m <- ggplot(tridat_m, aes(x = reorder(ngrams, freq), y = freq)) +
geom_bar(stat = "identity", fill = "#98D4D4FF") +
coord_flip() +
labs(x = "Trigrams", y = "Count", title = "Maik")
tri_all /
(tri_a | tri_m)One further feature introduced with the ngram package is the possibility of generating text from the base message string that has the same statistical properties as the n-grams that were found using markov chains as a sampling scheme.3 Obviously, the generated strings are not always grammatical (or even intelligible), but with regard to the rise of text generation/natural computerized messaging projects like ELIZA (Weizenbaum 1966), it is an interesting functionality. The tables below feature ten instances of sampled text of five tokens each for both speakers.
As a final in step the analysis, I will be analyszing the messages in the data set in terms of the emotions conveyed by them, as indicated by the results of the algorhithms implemented in the syuzhet package (Jockers 2015). The potential categories include: anger, disgust, anticipation, and joy, as well as more general categories like positive and negative, dubbed here as polarity. The latter part, though also clearly sentiment analysis and automatically computet with the syuzhet package, will also be implemented by using large word lists (the positive category totals about 2000 entries, the negative one approaches 5000) and looking for occurrences of these items in the message log.
As with the steps before, sentiments will be investigated three times: once for each of the speakers and once for the overall data set – the last of which is achieved by combining the individual speakers’ datasets to save computation time.
# do the sentiment analysis for Andrew's messages
d_andrew <- filter(d, sender == "Andrew")
sent_andrew <- get_nrc_sentiment(d_andrew$message)
# sentiment analysis for Maik's messages
d_maik <- filter(d, sender == "Maik")
sent_maik <- get_nrc_sentiment(d_maik$message)
# combine the two datasets into a complete sentiment representation
sent <- sent_andrew %>%
mutate(sender = "Andrew") %>%
bind_rows(sent_maik %>%
mutate(sender = "Maik"))
# and save it
write.csv(
sent, here("data", "sentiments.csv"),
row.names = FALSE, quote = FALSE
)The tables below feature the results of these computations and show that a surprising number of sentiments could be identified despite the informal nature of the body of text. This result could be interpreted in two ways: either both speakers use fairly standard written English even though their conversation takes place in an informal, more non-standard setting or the procedure of identifying the sentiments is sophisticated enough to handle even substandard exchanges.
# check the overall amounts of the new columns
sent_andrew %>%
bind_rows(sent_maik) %>%
summarise_all(list(sum))From the visualizations below, it is evident that although the absolute amounts differ, the relative ranking of the emotions between between both speakers is the same. Though interpreting these data in aggregated form is not without peril, one could infer that both speakers seemed to have experienced and expressed the same emotions over the course of the conversation. Note, of course, that anything stronger would have to be backed by further investigations, e.g. by binning the data and running the sentiment analysis once more. However, as this is beyond the scope of the current project, I will not explore this avenue any further.
plot_sent_all <- sent %>%
gather("type", "count", everything()) %>%
group_by(type) %>%
summarise(count = sum(as.numeric(count))) %>%
arrange(desc(count)) %>%
ggplot(aes(x = reorder(type, count), y = count)) +
geom_bar(stat = "identity", fill = "#98D4D4FF") +
coord_flip() +
labs(x = "Sentiment", y = "Count") +
scale_y_continuous(breaks = pretty_breaks(n = 12))
plot_sent_a <- sent_andrew %>%
gather("type", "count", everything()) %>%
group_by(type) %>%
summarise(count = sum(as.numeric(count))) %>%
arrange(desc(count)) %>%
ggplot(aes(x = reorder(type, count), y = count)) +
geom_bar(stat = "identity", fill = "#98D4D4FF") +
coord_flip() +
labs(title = "Andrew", x = "Sentiment", y = "Count") +
scale_y_continuous(
breaks = pretty_breaks(n = 12),
limits = c(0, 6000)
)
plot_sent_m <- sent_maik %>%
gather("type", "count", everything()) %>%
group_by(type) %>%
summarise(count = sum(as.numeric(count))) %>%
arrange(desc(count)) %>%
ggplot(aes(x = reorder(type, count), y = count)) +
geom_bar(stat = "identity", fill = "#98D4D4FF") +
coord_flip() +
labs(title = "Maik", x = "Sentiment", y = "Count") +
scale_y_continuous(
breaks = pretty_breaks(n = 12),
limits = c(0, 6000)
)
plot_sent_all /
(plot_sent_a | plot_sent_m)This section is an alternative form of sentiment analysis that focuses on word lists that are available online for the purpose of determining the polarity of conversational contributions. In contrast to what we did earlier, here we are not going to check for individual emotions. Instead, we will focus exclusively on the positive-negative dichotomy. Provided below are some examples for the positive and negative category, the full lists can be found in the assets directory for this document’s repository.
# read in the word lists to base the sentiment analysis on
neg_words <- scan(here("assets", "negative.txt"),
sep = "\n", what = "char"
)
pos_words <- scan(here("assets", "positive.txt"),
sep = "\n", what = "char"
)
head(pos_words, 15)
[1] "a+" "abound" "abounds" "abundance" "abundant" "accessable"
[7] "accessible" "acclaim" "acclaimed" "acclamation" "accolade" "accolades"
[13] "accommodative" "accomodative" "accomplish"
head(neg_words, 15)
[1] "2-faced" "2-faces" "abnormal" "abolish" "abominable" "abominably"
[7] "abominate" "abomination" "abort" "aborted" "aborts" "abrade"
[13] "abrasive" "abrupt" "abruptly" # compute the positive and negative scores based on the word list
d$n_pos <- sapply(d$message, USE.NAMES = FALSE, function(x) {
length(x[x %in% pos_words])
})
d$n_neg <- sapply(d$message, USE.NAMES = FALSE, function(x) {
length(x[x %in% neg_words])
})For both plotting purposes and statistical exploration, some simple measures relativizing polarity to word count and the other end of the polarity spectrum were computed below. The results of this are shown in the next table.
# create new data set that contains a subset
# where each message is at least one word long
dsent <- subset(d, word > 0)
# compute some simple stats
dsent$pos_ratio <- dsent$n_pos / dsent$word
dsent$neg_ratio <- dsent$n_neg / dsent$word
dsent$sent_val <- dsent$pos_ratio - dsent$neg_ratio
# aggregate the data for plotting
info_df <- aggregate(word ~ month, data = dsent, mean)
info_df <- merge(info_df,
aggregate(pos_ratio ~ month, data = dsent, mean),
by = "month"
)
info_df <- merge(info_df,
aggregate(neg_ratio ~ month, data = dsent, mean),
by = "month"
)
info_df <- merge(info_df,
aggregate(pos_ratio ~ month, data = dsent, sciplot::se),
by = "month"
)
info_df <- merge(info_df, aggregate(
neg_ratio ~ month,
data = dsent, sciplot::se
),
by = "month"
)
info_df <- merge(info_df, aggregate(sent_val ~ month, data = dsent, mean),
by = "month"
)
info_df <- merge(info_df, aggregate(sent_val ~ month,
data = dsent,
sciplot::se
), by = "month")
names(info_df)[3:8] <- c(
"mean_pos_ratio", "mean_neg_ratio",
"se.pos_ratio", "se.neg_ratio",
"mean.sent_val", "se.sent_val"
)
info_df$sent_pol <- ifelse(info_df$mean.sent_val > 0, "positive", "negative")As the visualization below shows, the sentiments expressed through the messages are largely stable in terms of their proportion, though their rate seems to increase over the course of the year. What is more, both speakers convey similar emotions and almost seem to mirror each other (as hinted at before). Though is expected to a degree – after all, it is often the case that if one speaker is, say, aggressive, so is the other – the small amount of inter-speaker variation, particularly in light of all the sentiments analyzed, is surprising.
# add sentiment analysis to main data set
d <- cbind(d, sent[1:(length(sent) - 1)])
sent_month <- d %>%
gather(
"sentiment", "count", positive, negative, trust, surprise,
sadness, joy, fear, disgust, anticipation, anger
) %>%
arrange(desc(count)) %>%
ggplot(aes(
x = month, y = count, color = reorder(sentiment, -count),
group = reorder(sentiment, count)
)) +
geom_line(stat = "summary", fun = mean) +
theme(legend.position = "bottom") +
labs(
x = "Month", y = "Mean Sentiment Count",
color = "Sentiment", group = "Sentiment"
) +
scale_color_manual(values = scico(10, palette = "batlow")) +
guides(
color = guide_legend(ncol = 6),
group = guide_legend(ncol = 6)
) +
facet_wrap(~sender, ncol = 1)
sent_monthAs for the categorization into positive and negative sentiments using the simpler word-list approach, we can clearly observe that positive ones invariably form the vast majority, though the distribution itself seems almost erratic when broken up by month as in the graph below. To investigate these patterns further, a more detailed analysis is needed, possibly by using all the previously discussed tools in a more targeted, hypothesis-driven way. Since such an in-depth look goes beyond the intentions of the current study, all interpretation would be mere speculation.
# separate polarity based on the month
sentiment_df <- data.frame(
month = rep(info_df$month, 2),
value = c(
info_df$mean_pos_ratio,
info_df$mean_neg_ratio * -1
),
errors = c(
info_df$se.pos_ratio,
info_df$se.neg_ratio
),
polarity = rep(c(
"positive", "negative"
),
each = nrow(info_df)
)
)
sentiment_df$polarity <- factor(sentiment_df$polarity, levels = c(
"positive", "negative"
))
pol_both <- ggplot(data = sentiment_df, aes(
x = month, y = value, fill = polarity
)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = value - errors, ymax = value + errors),
width = 0.25, alpha = .5
) +
labs(
y = "Mean sentiment value", x = "Month", fill = "Polarity"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_manual(values = c("#98D4D4FF", "#FF929AFF"))
pol_overall <- ggplot(data = info_df) +
aes(
x = reorder(month, mean.sent_val),
y = mean.sent_val,
fill = sent_pol
) +
geom_bar(stat = "identity") +
geom_errorbar(aes(
ymin = mean.sent_val - se.sent_val,
ymax = mean.sent_val + se.sent_val
),
width = 0.25, alpha = .5
) +
labs(y = "Mean sentiment value", x = "Month") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_manual(values = c("#FF929AFF", "#98D4D4FF")) +
guides(fill = FALSE) +
coord_flip()
pol_both /
pol_overallSumming up, let me try to recapitulate and bring to gather the, at first at least, very disparate approaches and results to the corpus linguistic methods that were applied over the course of this study. Despite the inhomogeneity of the techniques, there were points of contact at which one approach substantiated the results of another. For example, both the n-gram analyses and the word lists obtained through the virtual corpora suggest an overall similar vocabulary between both speakers and, in turn, naturally extended the message topic identification attempt using regular expressions.
Wihtin all of those results, there was a large amount of between both speakers; a finding that was again confirmed with the sentiment analysis where both speakers were found to convey the same range of emotions and, what was more surprising, to the same degree. Insofar then, all the analyses delivered similar impressions of the data under study, if from different angles. Thus, a wealth of techniques enables not only the investigation of different facets of the data but also serve to offer a great deal of corroboration.
The points of contrast, on the other hand, were less overarching in their appearance and generally only manifested either when more fine-grained measures were applied (for example with the unique speaker-level vocabulary). On this view then, contrastive studies, specifically when the sought after discrepancies are general, should employ a variety of approaches in order to facilitate higher identification chances.
# export final dataset exlcuding the message texts
d %>%
select(-message) %>%
write.table(here("data", "messages_full.csv"),
row.names = FALSE, quote = FALSE
)
# unique words
write.table(
maik_unique, here("data", "unique_maik.txt"),
row.names = FALSE, quote = FALSE
)
write.table(
andrew_unique, here("data", "unique_andrew.txt"),
row.names = FALSE, quote = FALSE
)
# sentence types
write.table(sentence_types, here("data", "sentence_types.csv"),
row.names = FALSE, quote = FALSE
)
# message types
write.table(dlong, here("data", "message_types.csv"),
row.names = FALSE, quote = FALSE
)
# sentiment analysis
write.table(info_df, here("data", "sentiment_stats.csv"),
row.names = FALSE, quote = FALSE
)xfun::session_info(dependencies = FALSE)
R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.5
Locale: en_US.UTF-8 / en_US.UTF-8 / en_US.UTF-8 / C / en_US.UTF-8 / en_US.UTF-8
Package version:
ngram_3.0.4 syuzhet_1.0.4 tm_0.7-7 NLP_0.2-0
wordcloud_2.6 RColorBrewer_1.1-2 hrbrthemes_0.8.0 scico_1.1.0
patchwork_1.0.0 scales_1.1.1 gganimate_1.0.5 styler_1.3.2
kableExtra_1.1.0 broom_0.5.6 syn_0.1.0 lubridate_1.7.8
janitor_2.0.1 psych_1.9.12.31 forcats_0.5.0 stringr_1.4.0
dplyr_0.8.5 purrr_0.3.4 readr_1.3.1 tidyr_1.0.3
tibble_3.0.1 ggplot2_3.3.0 tidyverse_1.3.0 knitr_1.28
here_0.1 nlme_3.1-147 fs_1.4.1 sf_0.9-4
webshot_0.5.2 progress_1.2.2 httr_1.4.1 rprojroot_1.3-2
R.cache_0.14.0 tools_4.0.0 backports_1.1.7 R6_2.4.1
KernSmooth_2.23-17 DBI_1.1.0 colorspace_1.4-1 withr_2.2.0
tidyselect_1.1.0 prettyunits_1.1.1 mnormt_1.5-7 compiler_4.0.0
extrafontdb_1.0 cli_2.0.2 rvest_0.3.5 xml2_1.3.2
sciplot_1.2-0 labeling_0.3 slam_0.1-47 classInt_0.4-3
systemfonts_0.2.1 digest_0.6.25 rmarkdown_2.1 R.utils_2.9.2
pkgconfig_2.0.3 htmltools_0.4.0 extrafont_0.17 dbplyr_1.4.3
rlang_0.4.7 readxl_1.3.1 rstudioapi_0.11 farver_2.0.3
generics_0.0.2 jsonlite_1.6.1 R.oo_1.23.0 magrittr_1.5
Matrix_1.2-18 Rcpp_1.0.5 munsell_0.5.0 fansi_0.4.1
reticulate_1.16 gdtools_0.2.2 lifecycle_0.2.0 R.methodsS3_1.8.0
stringi_1.4.6 yaml_2.2.1 snakecase_0.11.0 plyr_1.8.6
grid_4.0.0 parallel_4.0.0 crayon_1.3.4 lattice_0.20-41
haven_2.2.0 transformr_0.1.2.9000 hms_0.5.3 magick_2.4.0
pillar_1.4.4 lpSolve_5.6.15 codetools_0.2-16 reprex_0.3.0
glue_1.4.1 evaluate_0.14 modelr_0.1.7 vctrs_0.3.0
tweenr_1.0.1 Rttf2pt1_1.3.8 cellranger_1.1.0 gtable_0.3.0
rematch2_2.1.2 assertthat_0.2.1 xfun_0.13 e1071_1.7-3
class_7.3-17 viridisLite_0.3.0 units_0.6-7 ellipsis_0.3.1 Biber, Douglas, and Randi Reppen. 2015. The Cambridge Handbook of English Corpus Linguistics / Edited by Douglas Biber and Randi Reppen. Cambridge Handbooks in Language and Linguistics. Cambridge: Cambridge University Press.
Feinerer, Ingo, Kurt Hornik, and David Meyer. 2008. “Text Mining Infrastructure in R.” Journal of Statistical Software 25 (5): 1–54. http://www.jstatsoft.org/v25/i05/.
Fellows, Ian. 2018. Wordcloud: Word Clouds. https://CRAN.R-project.org/package=wordcloud.
Jockers, Matthew L. 2015. Syuzhet: Extract Sentiment and Plot Arcs from Text. https://github.com/mjockers/syuzhet.
Mahmoudi, Mohammad Reza, and Ali Abbasalizadeh. 2018. “How Statistics and Text Mining Can Be Applied to Literary Studies?” Digital Scholarship in the Humanities 34 (3): 536–41. https://doi.org/10.1093/llc/fqy069.
McKinney. 2010. “Data Structures for Statistical Computing in Python.” In Proceedings of the 9th Python in Science Conference, edited by Stéfan van der Walt and Jarrod Millman, 56–61. https://doi.org/10.25080/Majora-92bf1922-00a.
Pedersen, Thomas Lin, and David Robinson. 2020. gganimate: A Grammar of Animated Graphics. https://CRAN.R-project.org/package=gganimate.
R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
Revelle, William. 2018. psych: Procedures for Psychological, Psychometric, and Personality Research. Evanston, Illinois: Northwestern University. https://CRAN.R-project.org/package=psych.
Schmidt, Drew, and Christian Heckendorf. 2017. Guide to the ngram Package: Fast N-Gram Tokenization. https://cran.r-project.org/package=ngram.
The pandas development team. 2020. Pandas-Dev/Pandas: Pandas. Zenodo. https://doi.org/10.5281/zenodo.3509134.
Weizenbaum, Joseph. 1966. “ELIZA—a Computer Program for the Study of Natural Language Communication Between Man and Machine.” Communications of the ACM 9 (1): 36–45. https://doi.org/10.1145/365153.365168.
Wickham, Hadley. 2016. Ggplot2 – Elegant Graphics for Data Analysis. New York: Springer.
Wilkerson, John, and Andreu Casas. 2017. “Large-Scale Computerized Text Analysis in Political Science: Opportunities and Challenges.” Annual Review of Political Science 20 (1): 529–44. https://doi.org/10.1146/annurev-polisci-052615-025542.
See https://t3n.de/news/whatsapp-kein-chat-export-mehr-1238327/.↩︎
Note that words here are simply defined collections of characters delimited by whitespace characters, thus including emojis and other markers of informal conversation types.↩︎
Note that this process is not fully reproducible as ngram employs its own internal random number sampling process, rendering R’s set.seed() ineffective.↩︎